Noblis Team - Protovis and Processing
VAST 2010 Challenge
Genetic
Sequences – Tracing the Mutations of a Disease
Authors and Affiliations:
Catherine Campbell, PhD, Noblis, Team Lead, catherine.campbell@noblis.org
[PRIMARY contact]
Seth Blanchard, Noblis, seth.blanchard@noblis.org
Mitchell Holland, Noblis, mitchell.holland@noblis.org
Jill McCracken, Noblis, jill.mccracken@noblis.org
Harry Cummins, Noblis, graphic
artist, hcummins@noblis.org
Richard P. DiMassimo, video producer,
rdimassimo@noblis.org
Austin Blanton, Noblis Intern, austin.blanton@noblis.org
Noblis VAST Webpage: http://www.noblis.org/VAST
Tool(s):
Primary tools used for
this project included:
1) SNUFER (http://www.bioinformation.net/003/001300032008.htm)
was used to generate SNP tables and was developed by Mozart Marins
group at the Unidade de Biotecnologia in Sao Paulo Brazil, 2008:
2) Clustal W (http://www.clustal.org/) was used
to align sequences and generate phylogenetic trees. This program was
developed by Trinity College, Dublin in 1988
3) Perl (http://www.perl.org/) was used to
develop scripts to organize tables.
4) R (http://www.r-project.org/), a
statistical package, was used to calculate the significance of SNPs.
5) Processing (http://processing.org/) was used to
develop interactive SNP plots. This is an open source design language
started by Ben Fry and Casey Reas in 2001.
6) Protovis (http://vis.stanford.edu/protovis/)
was used to develop interactive pedigree trees in a space saving
format. This is an open source Javascript based visualization toolkit
developed at Stanford and released as an open source tool in 2009.
With the exception of SNUFER and Clustal W, which can be utilized by
general biologists and bioinformaticians, the rest of these toolkits
require some programming ability. Scripting in Perl and Processing have
low learning curves to start, but complex visualizations will take
users time to learn. However, once scripts are developed, any end
user-biologist, analyst, or bioinformatician can interact with the
visualizations with little or no training. Protovis has a longer
learning curve than Processing, but the interactive visualizations can
be used by any end user. R requires the most training, both in
statistics and scripting. This tool can be adapted to non-programmers
through the development of user-friendly interfaces.
Video:
Noblis_Processing_MC3.mp4
ANSWERS:
MC3.1: What
is the region or country of origin for the current outbreak? Please
provide your answer as the name of the native viral strain along with a
brief explanation.
We identified Nigeria as the
origin of the current outbreak. Twelve SNPs differentiate Nigeria_B
from the closest outbreak sequence (531). We used a combination of
SNUFER to identify SNPs, Perl to organize tables, and pedigree analysis
(Figure 3.1.1) manually drawn with Microsoft Visio, for lineage
visualization. The pedigree shows the Drafa Fever virus lineage in
Africa and links Nigeria_B to sequence 531. The pedigree is vertically
scaled to, and highlights, the number of SNPs that change between
sequences (numbers in arrows). Additionally, the temporal and
geographic progression of the disease across Africa was plotted (Figure
3.1.2) based on the data in the pedigree. Numbers on the map correspond
to
those on the pedigree and circles scale to the geographic area of the
sequence name. This entire analysis and visualization required
approximately 8 hours.

Figure 3.1.1. Pedigree Tree of Native Strains

Figure 3.1.2. Temporal and Geographic
Progression of Drafa Fever Virus
MC3.2: Over
time, the virus spreads and the diversity of the virus increases as it
mutates. Two patients infected with the Drafa virus are in the
same hospital as Nicolai. Nicolai has a strain identified by
sequence 583. One patient has a strain identified by sequence 123
and the other has a strain identified by sequence 51. Assume only
a single viral strain is in each patient. Which patient likely
contracted the illness from Nicolai and why? Please provide your
answer as the sequence number along with a brief explanation.
The patient with sequence 123
contracted Drafa virus from Nicolai Kuryakin (Sequence 583). Sequence
123 differs from sequence 583 by a single SNP indicating descendency. A
two person team spent four days to create two analytic visualizations,
a sunburst plot using Protovis (Figure 3.2.1), and a polar plot using
Processing (Figure 3.2.2). The sunburst plot is a novel, space-saving
way
to depict pedigrees, pulling data from a table of SNPs and automating
the visualization process. Descendents radiate out from a central
ancestor, and sequence 123 is in the direct lineage of sequence 583. In
the polar plot shown in Figure 2 we can drill down to highlight the
SNPs that drive the relationships between sequences, highlighting the
common SNPs shared between sequences 123 and 583 (red and orange SNPs)
contrasted to the blue SNP in sequence 51.

Figure 3.2.1. Sunburst Plot Showing
Pedigree of Patients

Figure 3.2.2. Polar Plot
Highlighting Patient SNPs
MC3.3: Signs
and symptoms of the Drafa virus are varied and humans react differently
to infection. Some mutant strains from the current outbreak have
been reported as being worse than others for the patients that come in
contact with them.
Identify the top 3
mutations that lead to an increase in symptom severity (a disease
characteristic). The mutations involve one or more base
substitutions. For this question, the biological properties of
the underlying amino acid sequence patterns are not significant in
determining disease characteristics.
For each mutation
provide the base substitutions and their position in the sequence (left
to right) where the base substitutions occurred. For example,
C -->G, 456 (C
changed to G at position 456)
G -->A, 513 and
T-->A, 907 (G changed to A at position 513 and T changed to A at
position 907)
A-->G, 39 (A
changed to G at position 39)
1. T --> C, 842 and A
--> T, 946: p-value 2.9 x 10-5 as a pair
2. A --> C, 269: p-value 0.0019
3. A --> G, 223: p-value 0.009
One bioinformatician spent two
days examining the association between SNPs and symptom severity using
the Mann-Whitney U test (in R). The three sets of significant SNPs
numbered above are illustrated by the patient groups circled in orange
on the pedigree (Figure 3.3.1, created with Protovis). Individual SNPs
driving these clusters were visualized using our polar plot (Figure
3.3.2,
created with Processing). We chose to combine SNPs 842 and 946 because
the most severe cases had both SNPs and with no samples having SNP 946
alone, it was impossible to determine if 946 alone or in concert with
842 is more significant.

Figure 3.3.1. Sunburst Plot
Highlighting Pedigree and Symptoms

Figure 3.3.2. Polar Plot Identifying
SNPs in Sequences
MC3.4:
Due to the rapid spread of the virus and limited resources, medical
personnel would like to focus on treatments and quarantine procedures
for the worst of the mutant strains from the current outbreak, not just
symptoms as in the previous question. To find the most dangerous
viral mutants, experts are monitoring multiple disease characteristics.
Consider each
virulence and drug resistance characteristic as equally
important. Identify the top 3 mutations that lead to the most
dangerous viral strains. The mutations involve one or more base
substitutions. In a worst case scenario, a very dangerous strain
could cause severe symptoms, have high mortality, cause major
complications, exhibit resistance to anti viral drugs, and target high
risk groups. For this question, the biological properties of the
underlying amino acid sequence patterns are not significant in
determining disease characteristics.
For each mutation
provide the base substitutions and their position in the sequence (left
to right) where the base substitutions occurred. For example,
C ? G, 456 (C
changed to G at position 456)
G ? A, 513 and T ?
A, 907 (G changed to A at position 513 and T changed to A at position
907)
A ? G, 39 (A
changed to G at position 39).
The 3 mutations that lead to
the most dangerous viral strains were:
1. T --> C, 842 and A
--> T, 946: p-value 0.0004
2. A --> G, 223:
p-value 0.0015
3. A --> C, 269:
p-value 0.0076
Throughout the genomics
challenge, visualizations and statistics have driven our analysis. The
outbreak sequences were aligned using Clustal W to identify the origin
of the outbreak. SNUFER was used to cluster the sequences and automate
the generation of SNP tables based on divergence from Nigeria_B. By
transforming and displaying the data in novel ways, we quickly found
that only 57 out of the 1404 total nucleotides in this sequence had a
SNP present in at least one sample. Perl was used to sort the data and
select the 21 SNPs that occurred in at least two—but not in
all—outbreak samples.
We used this filtered dataset
for further analysis in Excel. In Figure 3.4.1 we show a view of our
Excel
table in which high frequency SNPs (those occurring in more than 4
samples) are colored black in vertical bars and each patient’s
overall severity score (ranging from 1-8) is colored horizontally from
green to red (green for 1, yellow for 4, and red for 8). To arrive at
our overall severity score, we scored each categorical disease
characteristic from 0-2 (Complications—a binary
variable—was scored as 0 or 2). We then summed the scores for all
characteristics for each patient.
Colorizing our tables
immediately illustrated correlations between particular SNPs and
severity. From these color patterns it was easy to see that some SNPs,
like SNP 161, occur frequently, but in both severe and non-severe
cases. Other SNPs, such as SNP 223, only occur in severe cases.
Finally, some SNPs such as 22 seem to moderate disease severity. We
have made one major assumption in this analysis; once a SNP is present
it would rarely if ever back mutate. We have therefore focused only on
SNPs that are divergent from Nigeria_B. All of these initial tasks,
including the statistics below, required one bioinformatician
approximately 40 hours to complete.

Figure 3.4.1. Excel Table of SNPs by
Patient Showing Symptom Severity and High Frequency SNPs
We next statistically verified
which mutations were responsible for the most dangerous viral strains.
Based on our overall severity score, we examined the significance of
individual SNPs using the Mann-Whitney U test in R. For each SNP test
we created two patient vectors using the overall severity scores; one
of patients with the SNP, and one of patients without. We found five
mutations to be statistically significant (Figure 3.4.2). Two of these
mutations 842 and 946 we have previously determined to be highly
correlated, and we have concluded that both together are responsible
for overall disease severity. However, SNP 790 is also slightly
correlated with SNP 223, and it has a less significant p-value, thus we
have eliminated it from further consideration here.

Figure 3.4.2. P-values of SNPS
Associated With Disease Severity
Statistics generally lack a
visual punchline. We therefore wanted to develop new approaches to
visualizing genetic data to illustrate relationships among strains and
highlight SNPs that drive these relationships. The plots we developed
are interactive and allow users to focus on particular groupings of
SNPs. When we first developed polar plots for MC 3.2 we realized that
traditional phylogenetic trees did not clearly show patient groupings.
As an alternative, we decided to use pedigree plots to illustrate
direct ancestry. Most pedigree software packages track inheritance of
genetic disorders with parent-child data and do not adapt well to
single “parent” viral data. Therefore we initially created
pedigree trees manually using Microsoft Visio (Figure 3.4.3). Although
these plots provide an alternative way to look at genetic data we
wanted to automate these visualizations because these plots do not
scale well horizontally, and are difficult to render manually for more
than a few dozen patients.

Figure 3.4.3. Pedigree Chart of
Patients Colored By Overall Disease Severity
We used Protovis to automate
drawing pedigree plots. This toolkit allowed us to create the
interactive visualization shown in Figure 3.4.4. The pedigree is now
formed
in a circular sunburst chart to increase scalability. This script runs
as a web application and has a drop down box at the bottom allowing the
user to select among the five disease characteristics, which are each
displayed with a different color scheme, as well as the choice to color
the plot based on our overall severity scale. This plot illustrates the
hierarchical relationships among patients, and displays clusters of
patients with similar severity characteristics. The plot can be
rendered from tabular data and could easily be further developed into a
web tool that would allow users to import their own data for
visualization. This recolorized plot took one web application developer
approximately 1 hour to edit.

Figure 3.4.4. Sunburst Plot Showing
Pedigree of Patients Colored by Overall Severity
We developed a second
visualization to highlight individual SNPs and their relationships to
disease characteristics. We used Processing to make a polar plot that
interactively views and analyzes all SNPs simultaneously. Figure 3.5.5
shows this plot displaying the outbreak patients on the radii with all
57 SNPs —in order of their location in the
sequence—represented in concentric inner circles. This plot can
be colorized by any disease characteristic, but is colored here to
illustrate overall severity. The radii are colored by severity score to
match the sunburst plot, and the SNPs significantly associated with
overall severity are also colored. This visualization is interactive
with mouse-over information on the left panel showing the current
patient ID, SNP changes (and p-values where applicable) as well as the
disease characteristics (with severe characteristics highlighted red).
Groups of SNPs (like the SNPs shown in red) are easily identified, and
different SNPs contribute to separate branches of severe disease. The
development of this plot took one bioinformatician and one web
application developer approximately 16 hours to develop.

Figure 3.4.5. Polar Plot Showing
Significant SNPs Colored by Overall Disease Severity